0.1 Setting up the environment

library(readr)
library(tidyverse)
library(h2o)
library(car)
library(highcharter)
library(boot)
source("/Users/deyapple/Documents/Courses/Term04/DA/Homework/hw_07/images/unbalanced_functions.R")
mdata <- read_csv("/Users/deyapple/Documents/Courses/Term04/DA/Homework/hw_07/images/murder_suicide.csv")

1 Q1

1.1 Chossing A Non-redundant Subset

murder <- mdata %>%
  select(ResidentStatus, 
         Education = Education2003Revision, 
         Sex, 
         Age = AgeRecode52, 
         PlaceOfDeathAndDecedentsStatus, 
         MaritalStatus, 
         MethodOfDisposition, 
         MannerOfDeath, 
         Autopsy, 
         ActivityCode, 
         PlaceOfInjury, 
         Race = RaceRecode5)

After selecting a subset of variables, I have looked at all the variables containing unknown values. Based on the number of unknown observations, I have either deleted those observations or replaced them with the mean.

1.2 Unknown Values

1.2.1 Unknown Education

The number of observations is high, therefore I’ve replaced them with the mean of the others.

nrow(murder %>%
       filter(Education == 9))
[1] 1162
meanedu <- mean(murder %>%
                  filter(Education != 9) %>%
                  select(Education) %>%
                  unlist() %>%
                  unname())
murder <- murder %>%
  mutate(Education = ifelse(Education == 9, round(meanedu), Education))
murder$Education = as.integer(murder$Education)

1.2.2 Unknown Age

nrow(murder %>%
       filter(Age == 52))
[1] 16
murder <- murder %>%
  filter(Age != 52)

1.2.3 Unknown PlaceOfDeathAndDecedentsStatus

nrow(murder %>%
       filter(PlaceOfDeathAndDecedentsStatus == 9))
[1] 162
murder <- murder %>%
  filter(PlaceOfDeathAndDecedentsStatus != 9)

1.2.4 Unknown MaritalStatus

nrow(murder %>%
       filter(MaritalStatus == "U"))
[1] 630
murder$MaritalStatusFactor = as.integer(as.factor(murder$MaritalStatus))
meanm <- mean(murder %>%
                filter(MaritalStatusFactor != 4) %>%
                select(MaritalStatusFactor) %>%
                unlist() %>%
                unname)
murder <- murder %>%
  mutate(MaritalStatus = ifelse(MaritalStatus == "U", "M", MaritalStatus), 
         MaritalStatusFactor = ifelse(MaritalStatusFactor == 4, round(meanm), MaritalStatusFactor))

1.2.5 Unknown MethodOfDisposition

nrow(murder %>%
       filter(MethodOfDisposition == "U"))
[1] 5627
murder$MethodOfDispositionFactor = as.integer(as.factor(murder$MethodOfDisposition))
meand <- mean(murder %>%
                filter(MethodOfDispositionFactor != 7) %>%
                select(MethodOfDispositionFactor) %>%
                unlist() %>%
                unname())
murder <- murder %>%
  mutate(MethodOfDisposition = ifelse(MethodOfDisposition == "U", "C", MethodOfDisposition), 
         MethodOfDispositionFactor = ifelse(MethodOfDispositionFactor == 7, round(meand), MethodOfDispositionFactor))

1.2.6 Not Applicable Activity Code

nrow(murder %>%
       filter(ActivityCode == 99))
[1] 1143
murder <- murder %>%
  mutate(ActivityCode = ifelse(ActivityCode == 99, 9, ActivityCode))

1.3 Correlation Matrix

Filter(is.numeric, murder) -> murder.num
hchart(cor(murder.num))
murder.num.cor <- cor(murder.num, method = "spearman")

murder.cor.melt = melt(murder.num.cor)

ggplot(murder.cor.melt, aes(Var1, Var2)) + 
  geom_tile(aes(fill = value)) + 
  geom_text(aes(fill = value, label = round(value, 2)), size = 5, color = "black") +
  scale_fill_gradient2(low = muted("gold"), 
                       mid = "mistyrose", 
                       high = muted("mediumvioletred"), 
                       midpoint = 0) +
  theme(panel.grid.major.x=element_blank(), 
        panel.grid.minor.x=element_blank(), 
        panel.grid.major.y=element_blank(), 
        panel.grid.minor.y=element_blank(),
        panel.background=element_rect(fill="white"),
        axis.text.x = element_text(angle = 90, hjust = 1,vjust=1,size = 15,face = "bold"),
        plot.title = element_text(size=20,face="bold"),
        axis.text.y = element_text(size = 15,face = "bold")) + 
  ggtitle("Correlation Plot") + 
  theme(legend.title=element_text(face="bold", size=14)) + 
  scale_x_discrete(name="") +
  scale_y_discrete(name="") +
  labs(fill="Corr. Coef.")

1.4 Scatter Plot Matrix

scatterplotMatrix(murder.num.cor)

2 Q2

2.1 Studying The Effect of A Few Variables

murder <- murder %>% 
  mutate(Suicide = ifelse(MannerOfDeath == 2, 1, 0))

In each of the subsections below, whenever the \(p-value\) generated from the test is significant(meaning less than 0.05), it means that we can reject the null hypothesis that Suicide is independent from the variable being tested(meaning different values of the variable do not affect Suicide), otherwise we fail to reject the null hypothesis.

2.1.1 Gender

Significant \(p-value\)

chisq.test(murder$Sex, murder$Suicide)

    Pearson's Chi-squared test with Yates' continuity correction

data:  murder$Sex and murder$Suicide
X-squared = 23.23, df = 1, p-value = 1.437e-06

2.1.2 Race

Significant \(p-value\).

chisq.test(murder$Race, murder$Suicide)

    Pearson's Chi-squared test

data:  murder$Race and murder$Suicide
X-squared = 15250, df = 3, p-value < 2.2e-16

2.1.3 Education

Significant \(p-value\).

chisq.test(murder$Education, murder$Suicide)

    Pearson's Chi-squared test

data:  murder$Education and murder$Suicide
X-squared = 3558.3, df = 8, p-value < 2.2e-16

2.1.4 Age

Significant \(p-value\).

chisq.test(murder$Age, murder$Suicide)

    Pearson's Chi-squared test

data:  murder$Age and murder$Suicide
X-squared = 6724.3, df = 42, p-value < 2.2e-16

2.1.5 Method Of Disposition

Significant \(p-value\).

chisq.test(murder$MethodOfDisposition, murder$Suicide)

    Pearson's Chi-squared test

data:  murder$MethodOfDisposition and murder$Suicide
X-squared = 4252.3, df = 5, p-value < 2.2e-16

3 Q3

3.1 Fitting A Model Using glm()

Some of the predictors can’t be viewed as numerical data such as PlaceOfDeathAndDecedentsStatus and Race. I have used as.factor() to account for this.

murder$PlaceOfDeathAndDecedentsStatus = as.factor(murder$PlaceOfDeathAndDecedentsStatus)
murder$Race = as.factor(murder$Race)
fit1 <- glm(data = murder, 
           Suicide ~ Education + Sex + Age + 
             PlaceOfDeathAndDecedentsStatus + MaritalStatus + 
             MethodOfDisposition + Race, 
           family = "binomial")
summary(fit1)

Call:
glm(formula = Suicide ~ Education + Sex + Age + PlaceOfDeathAndDecedentsStatus + 
    MaritalStatus + MethodOfDisposition + Race, family = "binomial", 
    data = murder)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9245  -0.4615   0.3807   0.6142   2.5920  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -2.943902   0.142377 -20.677  < 2e-16 ***
Education                        0.184284   0.007228  25.495  < 2e-16 ***
SexM                             0.314484   0.027638  11.379  < 2e-16 ***
Age                              0.076176   0.003903  19.517  < 2e-16 ***
PlaceOfDeathAndDecedentsStatus2 -0.411428   0.043396  -9.481  < 2e-16 ***
PlaceOfDeathAndDecedentsStatus3  0.161772   0.085141   1.900 0.057427 .  
PlaceOfDeathAndDecedentsStatus4  1.357213   0.038659  35.107  < 2e-16 ***
PlaceOfDeathAndDecedentsStatus5 -0.725411   0.175044  -4.144 3.41e-05 ***
PlaceOfDeathAndDecedentsStatus6 -0.688298   0.166279  -4.139 3.48e-05 ***
PlaceOfDeathAndDecedentsStatus7  0.286811   0.037960   7.556 4.17e-14 ***
MaritalStatusM                   0.220729   0.035756   6.173 6.69e-10 ***
MaritalStatusS                  -0.143932   0.036190  -3.977 6.98e-05 ***
MaritalStatusW                  -0.236247   0.061314  -3.853 0.000117 ***
MethodOfDispositionC             0.792471   0.024501  32.345  < 2e-16 ***
MethodOfDispositionD             1.211853   0.285413   4.246 2.18e-05 ***
MethodOfDispositionE             0.078582   0.162566   0.483 0.628822    
MethodOfDispositionO            -0.281884   0.161314  -1.747 0.080563 .  
MethodOfDispositionR             0.039925   0.062053   0.643 0.519961    
Race2                           -2.141663   0.029028 -73.779  < 2e-16 ***
Race3                           -0.398895   0.083023  -4.805 1.55e-06 ***
Race4                           -0.152617   0.068629  -2.224 0.026162 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 70984  on 59800  degrees of freedom
Residual deviance: 48884  on 59780  degrees of freedom
AIC: 48926

Number of Fisher Scoring iterations: 5

For each each data point the associated deviance is calculated. This results in a set of residuals. This first part of our model(Deviance Residuals’ Min, 1Q, etc.) is simply a non-parametric description of the residuals’ distribution.

Taking our model as a whole, the Residual deviance measures the lack of fit, whereas the Null deviance is a measure for a reduced model only consisting of the intercept.

Overall, the \(p-values\) seem to be significant, allthough there are a few levels of some of the categorical variables which have a high \(p-value\)(such as MethodOfDisposition). However, it is best not to modify our predictors by deleting the levels with high \(p-values\) and keeping the ones with low\(p-value\)s.

Let’s consider deleting MethodOfDisposition from our model(since 3 out of 5 levels have insignificant \(p-values\)):

fit2 <- glm(data = murder, 
           Suicide ~ Education + Sex + Age + 
             PlaceOfDeathAndDecedentsStatus + MaritalStatus + Race, 
           family = "binomial")
summary(fit2)

Call:
glm(formula = Suicide ~ Education + Sex + Age + PlaceOfDeathAndDecedentsStatus + 
    MaritalStatus + Race, family = "binomial", data = murder)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8105  -0.4762   0.3970   0.6308   2.5323  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -2.831326   0.141991 -19.940  < 2e-16 ***
Education                        0.150195   0.006972  21.544  < 2e-16 ***
SexM                             0.282516   0.027272  10.359  < 2e-16 ***
Age                              0.092454   0.003883  23.808  < 2e-16 ***
PlaceOfDeathAndDecedentsStatus2 -0.429922   0.042848 -10.034  < 2e-16 ***
PlaceOfDeathAndDecedentsStatus3  0.133277   0.084283   1.581    0.114    
PlaceOfDeathAndDecedentsStatus4  1.371779   0.038217  35.894  < 2e-16 ***
PlaceOfDeathAndDecedentsStatus5 -0.721002   0.173872  -4.147 3.37e-05 ***
PlaceOfDeathAndDecedentsStatus6 -0.699480   0.165010  -4.239 2.24e-05 ***
PlaceOfDeathAndDecedentsStatus7  0.315316   0.037470   8.415  < 2e-16 ***
MaritalStatusM                   0.137952   0.035301   3.908 9.31e-05 ***
MaritalStatusS                  -0.193289   0.035748  -5.407 6.41e-08 ***
MaritalStatusW                  -0.361612   0.060610  -5.966 2.43e-09 ***
Race2                           -2.299078   0.028545 -80.542  < 2e-16 ***
Race3                           -0.641707   0.081656  -7.859 3.88e-15 ***
Race4                           -0.177164   0.067829  -2.612    0.009 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 70984  on 59800  degrees of freedom
Residual deviance: 49997  on 59785  degrees of freedom
AIC: 50029

Number of Fisher Scoring iterations: 5

The AIC value turned out higher than our previous model. As well as a higher AIC value, the second model also has higher Residual Deviance, therefor fit seems to be more suitable than fit2. (Take note that AIC is uninformative when we only have one model. It can only be used to compare models.)

3.2 Testing How Well Our Model Fits

3.2.1 Hosmer-Lemeshow Goodness of Fit

The Hosmer-Lemeshow test is a goodness of fit test for logistic regression, only used for binary response variables (Suicide or not). The test’s output consists of a Hosmer-Lemeshow chi-squared value and a \(p-value\)(\(H_0\): the fitted model is correct). Small \(p-value\)s mean that the model is a poor fit.

library(ResourceSelection)
hoslem.test(murder$Suicide, fitted(fit1))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  murder$Suicide, fitted(fit1)
X-squared = 64.351, df = 8, p-value = 6.484e-11
hoslem.test(murder$Suicide, fitted(fit2))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  murder$Suicide, fitted(fit2)
X-squared = 93.436, df = 8, p-value < 2.2e-16

3.2.2 Diagnostic Plots

 The top right plot is a measure of normality. The dots are close to the dashed line, therefor we can conclude that our model justifies this assumption.

 The top left plot should be close to the zero line and free of any pattern, which has clearly been violated in our model. This plot is a measure of linearity.

 In the bottom right plot the dots all be in the 0.05 range, which are visibly lower, so our model has behaved well in this aspect. This plot is used to determine influential observations.

 Overall, we can conclude that this model is not a good fit.

glm.diag.plots(fit1, glmdiag = glm.diag(fit1))

4 Q4

murder <- murder %>%
  mutate(pred = fitted(fit1, type = "response"))

4.1 Density Plot

As you can see, the peeks of each distribution is far apart, therefor our model has behaved well in distinguishing most of the suicides from the murders. However, towards the right side of the plot, we can see that our model could not differentiate between the suicides and the murders.

ggplot(murder) + 
  geom_density(aes(x = pred, fill = as.factor(Suicide)), alpha = 0.5, size = 1) + 
  theme_minimal()

4.2 Histogram

We can see that the difference in means of predictions for suicides and murders seem to be high.

ggplot(murder) + 
  geom_histogram(aes(x = Suicide, y = mean(pred), fill = as.factor(Suicide)), stat = "identity") + 
  facet_grid(~Suicide) + 
  theme_minimal()

4.3 Scatter Plot + Line Graph

From the below graph, we can see that in contrast to what was shown before, the model did not perform well.

ggplot() + 
  geom_line(aes(x = fit1$linear.predictors, y = fit1$fitted.values)) + 
  geom_point(aes(x = fit1$linear.predictors, y = murder$Suicide)) + 
  theme_minimal()

4.4 Scatter Plot

We can tell that our model performs well for high and low values of prediction, but not so much for the average values.

ggplot(murder) + 
  geom_point(aes(x = Age, y = pred, col = as.factor(Suicide)), size = 1, position = "jitter") + 
  theme_minimal()

5 Q5

5.1 Dividing The Data into Train and Test

murder <- na.omit(murder)

index = sample(x = 1:nrow(murder),
               size = 0.8 * nrow(murder),
               replace = F)
train = murder[index,] 
test =  murder[-index,]

model_glm <- glm(data = train, 
    Suicide ~ Education + Sex + Age + 
             PlaceOfDeathAndDecedentsStatus + MaritalStatus + 
             MethodOfDisposition + Race, 
    family = "binomial")
test$prediction = predict(model_glm, newdata = test, type = "response")
train$prediction = predict(model_glm, newdata = train, type = "response")

5.2 Calculating P, N, TP, …

P <- test %>%
  filter(Suicide == 1) %>%
  nrow()
P
[1] 8633
N <- test %>%
  filter(Suicide == 0) %>%
  nrow()
N
[1] 3328
TP <- test %>%
  filter(Suicide == 1, 
         prediction >= .5) %>%
  nrow()
TP
[1] 8110
TN <- test %>%
  filter(Suicide == 0, 
         prediction < .5) %>%
  nrow()
TN 
[1] 1772
FP <- test %>%
  filter(Suicide == 0, 
         prediction >= .5) %>%
  nrow()
FP
[1] 1556
FN <- test %>%
  filter(Suicide == 1, 
         prediction < .5) %>%
  nrow()
FN
[1] 523
ACC <- (TP + TN) / (P + N)
ACC
[1] 0.8261851
FPR <- 1 - TN / N
FPR
[1] 0.4675481
TPR <- TP / P
TPR
[1] 0.9394185

5.3 Visualization of FP, FN, TP and TN

5.3.1 Confusion Matrix

cm_info = ConfusionMatrixInfo( data = test, predict = "prediction", 
                               actual = "Suicide", cutoff = .5)
cm_info$plot

5.3.2 Table

table(test$Suicide,ifelse(test$prediction>0.5,1,0)) %>% plot()

6 Q6

accuracy_info = AccuracyCutoffInfo(train = train, test = test, 
                                    predict = "prediction", actual = "Suicide" )
accuracy_info$plot

Since each time we generate the code, different subsets are chosen for test and train, I’ve written the code below the find out the cutoff related to maximum accuracy.

dacc <- accuracy_info$data
dacc.max <- dacc %>%
  mutate(trainmax = (train == max(train)), 
         testmax = (test == max(test)))
test.max <- dacc.max %>%
  filter(testmax)
train.max <- dacc.max %>%
  filter(trainmax)

new test cutoff:

test.cutoff <- test.max$cutoff
test.cutoff
[1] 0.55

new train cutoff:

train.cutoff <- train.max$cutoff
train.cutoff
[1] 0.5

To better visualize the change in accuracy, I have once again drawn a confusion matrix and a table but this time, with the (maybe) different cutoff.

cm_info = ConfusionMatrixInfo( data = test, predict = "prediction", 
                               actual = "Suicide", cutoff = test.cutoff)
cm_info$plot

table(test$Suicide,ifelse(test$prediction>test.cutoff, 1, 0)) %>% plot()

7 Q7

7.1 ROC Curve

cost_fp = 100;cost_fn = 200
roc_info = ROCInfo(data = cm_info$data, predict = "predict", 
                     actual = "actual", cost.fp = cost_fp, cost.fn = cost_fn )
grid.draw(roc_info$plot)

The bottom green part of the curve that intersects with the vertical dashed line shows the cutoff with minimum cost. (maximum accuracy)

8 Using H2O

library(h2o)
h2o.init()

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/mk/jrl557gd0qz741jsx86f3gjr0000gn/T//RtmpuAaQGK/h2o_deyapple_started_from_r.out
    /var/folders/mk/jrl557gd0qz741jsx86f3gjr0000gn/T//RtmpuAaQGK/h2o_deyapple_started_from_r.err


Starting H2O JVM and connecting: ... Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         5 seconds 18 milliseconds 
    H2O cluster timezone:       Asia/Tehran 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.19.0.4257 
    H2O cluster version age:    19 days  
    H2O cluster name:           H2O_started_from_R_deyapple_ste729 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.3 (2017-11-30) 
hmurder <- as.h2o(murder)

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
hglm = h2o.glm(y = "Suicide", x= c("Education","Sex","Age", "PlaceOfDeathAndDecedentsStatus", "MaritalStatus", "MethodOfDisposition", 
                                   "Race"),
               training_frame = hmurder, family="binomial",nfolds = 5)

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

The MSE/RMSE isn’t really high, therefor we can conclude that our model isn’t completely off.

Log Loss quantifies the accuracy of a classifier by penalising false classifications(source). It is used in many competition such as competitions on Kaggle. The goal is to minimize LogLoss. Since the LogLoss of our model relatively high, we can conclude that our model does not have a high accuracy.

AUC (Area Under Curve) is high with also questions the goodness of fit of our model.

The \(R^2\) Statistic is also low, which means only 35% of the variance in our response was explained by the model.

Let’s take a look at the part labeled Maximum Metrics. The fourth metric gives information related to the cutoff giving the maximum accuracy. The threshold column (0.530612) is the cutoff for which the maximum accuracy occurs. This maximum value is shown in the value column(0.820756).

h2o.performance(hglm)
H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.131186
RMSE:  0.3621961
LogLoss:  0.4203796
Mean Per-Class Error:  0.2910992
AUC:  0.8363014
Gini:  0.6726028
R^2:  0.3500981
Residual Deviance:  50278.24
AIC:  50302.24

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          0     1    Error          Rate
0      7623  9156 0.545682   =9156/16779
1      1571 41451 0.036516   =1571/43022
Totals 9194 50607 0.179378  =10727/59801

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.395048 0.885431 272
2                       max f2  0.209494 0.937819 328
3                 max f0point5  0.721475 0.866406 160
4                 max accuracy  0.530612 0.820756 231
5                max precision  0.981940 1.000000   0
6                   max recall  0.055379 1.000000 388
7              max specificity  0.981940 1.000000   0
8             max absolute_mcc  0.554530 0.527019 223
9   max min_per_class_accuracy  0.750321 0.763935 150
10 max mean_per_class_accuracy  0.721475 0.766193 160

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

9 Q9

  Overall, the model presented here, due to various reasons stated in the previous sections, is definitely not suitable for use in real life court.
  The accuracy of our model (about 0.8) is not high enough to be reliable. There aren’t enough information (predictors) to decide from.
  The True Positive Rate(TPR) is really high. Therefor our model behaves well in recognizing actual suicides, but behaves poorly in recognizing murders. However the number of True Negatives are extremely low compared to the total number of negatives. This is equivalent to setting a criminal free. The False Positive Rate is also high (0.4). This means more than one out of three times, a death that was actually a suicide will be declared a murder. This is equivalent to sending an innocent person to prison. The former error is much worse than the latter. In court it is much more preferred to set a criminal free than to send an innocent person to prison. However, since it is impossible to decrease both these errors to (even close to) zero, it is safer to not use our basic model in court where there is a matter of life and death :). Nevertheless, it can be used to help find a better model with less error